4.9 Resampled Paired t-Test

k-hold-out paired t-testとも

a popular method for comparing the performance of two models (classifiers or regressors); however, this method has many drawbacks and is not recommended to be used in practice as Dietterich noted

「2つのモデルの性能を比較する人気のある手法」

「Dietterichが記したようにいくつも欠点があり、実際に使うのは非推奨」

手順を例で説明

分類器C_1とC_2

ラベル付きデータセット S

訓練セット 2/3、テストセット 1/3に分割

In the resampled paired t-test procedure, we repeat this splitting procedure (with typically 2/3 training data and 1/3 test data) k times (usually 30 or more).

「resampled paired t検定の手順では、分割して訓練をk回（30回以上）繰り返す」

繰り返しの各回では、C_1もC_2も同じ訓練セットで訓練し、同じテストセットで評価する

k個の差分を得る

仮定：k個の差分は独立に取り出され、正規分布に近似的に従う

we can compute the following t statistic with k − 1 degrees of freedom according to Student’s t-test,

「スチューデントのt検定に沿ってk-1自由度のt統計量を計算できる」

帰無仮説は「C_1とC_2は等しい性能である」

ACC_i = ACC_(i,C_1) - ACC_(i,C_2)

i回目のイテレーションでのC_1とC_2のACC (accuracy)の差

ACC_avgはACC_iの平均

統計量t（スチューデントのt）が計算できたらp値を計算しαと比較

p値がαより小さければ帰無仮説は棄却され、C_1とC_2の間に有意差がある

問題点

it violates the assumptions of Student’s t-test, as the differences of the model performances are not normally distributed because the accuracies are not independent

「スチューデントのt検定の仮定を破っている」

「accuracyは独立ではないので、モデルの性能の差は正規分布に従わない」

Also, the differences between the accuracies themselves are also not independent since the test sets overlap upon resampling

「テストセットがリサンプリングで重なるため、accuracyの差自体も独立ではない」

実装：paired_ttest_resample: Resampled paired t test

ただし比較研究用